Appears  in  Pattern  Recognition  30(4),  April  1997,  special  issue  on  image  databases. 


Deformable  Prototypes  for  Encoding  Shape  Categories 

in  Image  Databases 


Stan  Sclaroff 

Computer  Science  Department 
Boston  University 
111  Cummington  St. 
Boston  MA  02215 


Abstract 

We  describe  a  method  for  shape-based  image  database  search 
that  uses  deformable  prototypes  to  represent  categories.  Rather 
than  directly  comparing  a  candidate  shape  with  all  shape  entries 
in  the  database,  shapes  are  compared  in  terms  of  the  types  of 
nonrigid  deformations  ( differences )  that  relate  them  to  a  small 
subset  of  representative  prototypes.  To  solve  the  shape  corre¬ 
spondence  and  alignment  problem,  we  employ  the  technique  of 
modal  matching,  an  information-preserving  shape  decomposi¬ 
tion  for  matching,  describing,  and  comparing  shapes  despite  sen¬ 
sor  variations  and  nonrigid  deformations.  In  modal  matching, 
shape  is  decomposed  into  an  ordered  basis  of  orthogonal  princi¬ 
pal  components.  We  demonstrate  the  utility  of  this  approach  for 
shape  comparison  in  2-D  image  databases. 

Keywords:  Deformable  models,  deformable  templates,  combi¬ 
nations  of  models,  shape  matching,  modal  matching. 

1  Introduction 

Shape  categories  can  be  represented  as  deformations  from  a 
subset  of  standard  or  prototypical  shapes;  it  is  thought  that  this 
is  one  plausible  mechanism  for  human  perception  [4;  18;  30;  33; 
40;  45],  This  basic  premise  is  appealing  for  its  descriptive  par¬ 
simony,  and  has  served  as  inspiration  for  many  of  the  prototype- 
based  representations  for  machine  vision,  robotics,  and  simula¬ 
tion. 

In  the  work  described  in  this  paper,  our  aim  is  to  represent 
shape  categories  for  interactive,  image  database  search.  Rather 
than  directly  comparing  a  candidate  shape  with  all  shapes  in  the 
database,  we  propose  a  method  that  first  indexes  shapes  in  terms 
of  their  relationship  to  a  few  shape  prototypes.  To  do  this,  we 
will  employ  modal  matching,  a  deformable  shape  decomposition 
that  allows  users  to  specify  a  few  example  shapes  and  has  the 
computer  efficiently  sort  the  set  of  objects  based  on  the  similarity 
of  their  shape.  If  desired,  shapes  can  be  more  closely  compared 
in  terms  of  the  types  of  nonrigid  deformations  (differences)  that 
relate  them  to  a  few  prototype  shapes. 

Our  approach  is  related  to  morphing,  a  computer  graphics  tech¬ 
nique  that  has  become  quite  popular  in  advertisements.  Morph¬ 
ing  is  accomplished  by  an  artist  identifying  a  large  number  of 
corresponding  control  points  in  two  images,  and  then  incremen¬ 


tally  deforming  the  geometry  of  the  first  image  so  that  its  con¬ 
trol  points  eventually  lie  atop  the  control  points  of  the  second 
image.  Using  this  technique,  in-between  or  novel  views  can  be 
generated  as  warps  between  example  views.  This  suggests  an 
important  way  to  obtain  a  low-dimensional,  parametric  descrip¬ 
tion  of  shape:  interpolate  between  known,  prototype  views.  For 
instance,  given  views  of  the  extremes  of  a  motion  ( e.g .,  systole 
and  diastole,  or  left-leg  forward  and  right-leg  forward)  we  can 
describe  the  intermediate  views  as  a  smooth  combination  of  the 
extremal  views. 

All  that  is  required  to  determine  this  view-based  parameter¬ 
ization  of  a  new  shape  are:  the  prototype  views,  point  corre¬ 
spondences  between  the  new  shape  and  the  prototype  views, 
and  a  method  of  measuring  the  amount  of  (nonrigid)  deforma¬ 
tion  that  has  occurred  between  the  new  shape  and  each  prototype 
view.  The  prototypes  define  a  polytope  in  the  space  of  the  (un¬ 
known)  underlying  physical  system's  parameters.  By  measuring 
the  amount  of  deformation  between  the  new  shape  and  extremal 
views,  we  locate  the  new  shape  in  the  coordinate  system  defined 
by  the  polytope.  This  coordinate  in  prototype  space  can  be  used 
for  database  indexing  and  fast  search. 

This  general  approach  is  related  in  spirit  to  the  linear- 
combinations-of-views  paradigm,  where  any  object  view  can  be 
synthesized  as  a  combination  of  linearly-warped  example  views 
of  Ullman  and  Basri  [56]  and  Poggio,  et  al.  [44],  However,  it 
differs  from  their  proposals  in  two  important  ways.  First,  we  are 
interested  not  only  in  recognizing  shapes,  but  also  in  describing 
the  types  of  deformations  that  relate  them.  We  want  to  derive  a 
low-dimensional  parametric  representation  of  the  shape  that  can 
be  used  to  recognize  and  compare  shapes,  in  the  manner  of  Dar¬ 
rell  and  Pentland  [12],  Second,  we  cannot  be  restricted  to  a  lin¬ 
ear  framework.  Nonrigid  motions  are  inherently  nonlinear,  al¬ 
though  they  are  often  “physically  smooth.”  Therefore,  to  employ 
a  combination-of-views  approach  we  must  be  able  to  determine 
point  correspondences  and  measure  similarities  between  views  in 
a  way  that  takes  into  account  at  least  qualitative  physics  of  non- 
rigid  shape  deformation.  In  computer  graphics  it  is  the  job  of 
the  artist  to  enforce  the  constraint  of  physical  smoothness;  in  ma¬ 
chine  vision,  we  need  to  be  able  to  do  the  same  automatically. 

To  achieve  this,  we  will  employ  modal  matching,  a  method 
for  (1)  determining  point  correspondences  using  a  energy-based 
model,  (2)  warping  or  morphing  one  shape  into  another  using 
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Figure  1:  The  data  needed  to  build  two  deformable  prototype  shape 
models.  Support  maps  are  shown  in  (b,e)  and  edge  maps  in  (c,f).  The 
two  prototype  shape  models  depict  (a)  a  European  Hare,  and  (d)  a  Desert 
Cottontail.  Their  associated  The  original  color  images  were  digitized 
from  [1], 
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Prototype  1 


Figure  2:  When  a  new  shape  is  encountered,  it  is  parameterized  in  terms 
of  the  energy  needed  to  nonrigidly  deform  the  prototype  shape  models 
into  alignment  with  the  new  shape.  The  distance  to  prototypes  is  ex¬ 
pressed  as  the  square  root  of  strain.  The  resulting  tuple  (10.1,  5.7)  is 
used  to  represent  the  shape  in  a  space  defined  in  terms  of  distance  to 
prototypes. 


energy-based  interpolants,  and  (3)  measuring  the  amount  of  de¬ 
formation  between  an  object's  shape  and  prototype  views[48; 
50]. 

Figure  1  shows  the  information  required  to  build  modal  shape 
prototypes  for  two  rabbit  shape  prototypes  employed  in  our  im¬ 
age  database  experiments.  In  our  system  a  shape  is  defined  by: 
a  cloud  of  feature  locations  (i.e.,  edges,  corners,  high-curvature 
points)  and  a  region  of  support  that  tells  us  where  the  shape  is. 
Given  this  input,  deformable  prototype  models  are  built  directly 
from  feature  data,  using  a  finite  element  formulation  that  is  based 
on  Gaussian  interpolants  [50],  For  efficiency,  we  can  select  a  sub¬ 
set  of  the  feature  data  as  nodes  for  a  lower-resolution  finite  ele¬ 
ment  model  and  then  use  the  resulting  eigenmodes  in  finding  the 
higher-resolution  feature  correspondences  as  described  in  [50], 
This  subset  can  be  a  set  of  particularly  salient  features  (i.e.,  cor¬ 
ners,  T-junctions,  and  edge  mid-points)  or  a  randomly  selected 
subset  of  (roughly)  uniformly-spaced  features. 


Figure  3:  Scatter  plot  of  square-root  modal  strain  energy  for  rabbit  pro¬ 
totypes  used  in  the  image  database  experiment.  Each  axis  depicts  the 
square-root  of  strain  energy  needed  to  align  a  shape  with  a  rabbit  pro¬ 
totype.  Thus  each  rabbit  shape  has  a  coordinate  in  this  space.  The  rab¬ 
bits  are  clustered  in  terms  of  their  2-D  shape  appearance:  long-legged, 
standing  rabbits  cluster  at  the  top-left  of  the  graph,  while  short-legged, 
seated  rabbits  cluster  at  the  bottom  right.  There  are  two  rabbits  that 
map  between  clusters,  showing  the  smooth  ordering  from  long-legged, 
to  medium-legged,  to  short-legged  rabbits  in  this  view-space. 


When  a  new  shape  is  encountered,  it  is  parameterized  in  terms 
of  the  energy  needed  to  nonrigidly  deform  the  prototype  shape 
models  into  alignment  with  the  new  shape.  Similarity  is  thus 
computed  in  terms  of  the  amount  of  strain  energy  needed  to  de¬ 
form  each  prototype  to  match  it  to  the  candidate  shape,  as  illus¬ 
trated  in  Figure  2.  The  amounts  of  deformation  are  measured 
in  terms  of  strain  energy  and  stored  as  a  n-tuple,  where  n  is  the 
number  of  prototype  shapes  employed.  In  this  case,  the  result¬ 
ing  tuple  is  (10.1,  5.7).  The  result  is  a  low-dimensional  paramet¬ 
ric  representation  that  can  be  used  for  efficient  shape-based  im¬ 
age  database  search.  Rather  than  directly  comparing  a  candidate 
shape  with  all  shape  entries  in  a  database,  we  instead  compute 
similarity  in  a  distance  to  prototypes  space.  Using  this  method, 
we  compactly  represent  a  category  of  shapes  in  terms  of  a  few 
prototype  views. 

Fig.  3  shows  a  scatter  plot  of  the  two-dimensional  “rabbit 
space”  spanned  by  two  rabbit  shape  prototypes.  The  graph's  x- 
axis  depicts  the  square -root  of  strain  energy  needed  to  align  the 
European  Hare  prototype  with  each  rabbit  shape,  while  the  y-axis 
shows  the  energy  needed  to  align  the  Desert  Cottontail  prototype 
with  each  rabbit  shape.  Each  or  the  12  rabbit  shapes  has  a  co¬ 
ordinate  in  this  strain-energy-from-prototype  subspace.  As  can 
be  seen,  the  rabbits  are  clustered  in  terms  of  their  2-D  shape  ap¬ 
pearance:  long-legged,  standing  rabbits  cluster  at  the  top-left  of 
the  graph,  while  short-legged,  seated  rabbits  cluster  at  the  bottom 
right.  There  are  two  rabbits  that  map  between  clusters,  show¬ 
ing  the  smooth  ordering  from  long-legged,  to  medium-legged,  to 
short-legged  rabbits  in  this  view-space. 
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We  will  demonstrate  the  utility  of  this  approach  for  compar¬ 
ing  shapes  in  2-D  image  databases  digitized  from  children's  field 
guides  and  images  of  hand  tools.  Deformable  shape  models  will 
be  built  and  compared  using  support  and  silhouette  data.  The 
methods  described  in  this  paper  are  also  useful  for  recognizing  or 
classifying  motions  [49],  fusing  data  from  different  sensors,  and 
for  comparing  data  acquired  at  different  times  or  under  different 
conditions  [50]. 

1.1  Segmentation 

The  work  reported  in  this  paper  addresses  issues  of  shape  cate¬ 
gorization,  even  when  shapes  within  categories  can  undergo  both 
rigid  and  nonrigid  motion.  Throughout  this  paper,  it  was  as¬ 
sumed  that  figure/ground  segmentation  information  can  be  pro¬ 
vided  as  input  to  the  modal  shape  comparison  modules.  Since 
segmentation  is  not  the  topic  of  this  paper,  our  current  databases 
contain  images  of  unoccluded  objects  on  uniform  backgrounds. 
Under  these  circumstances,  a  c-Means  clustering  and  thresh¬ 
olding  technique  can  be  used  for  foreground/background  sepa¬ 
ration  [31].  However,  for  very  general  query  by  shape,  fore¬ 
ground/background  modules  will  be  needed  as  a  front-end  to  the 
system.  The  first  solution  would  be  to  use  motion  and  color  to 
pull  out  foreground  objects.  Such  figure-ground  segmentation 
can  be  done  reliably  by  use  of  clustering  in  conjunction  with  op¬ 
tical  flow  [5;  13;  54;  57;  58]  and/or  color  information  [8;  24;  29; 
32;  37;  31]. 

2  Background  and  Notation 

In  the  last  few  years  researchers  have  made  some  progress  to¬ 
ward  automatic  shape  indexing  for  image  databases.  The  gen¬ 
eral  approach  has  been  to  calculate  some  approximately  invariant 
statistic  like  shape  moments,  and  use  these  to  stratify  the  image 
database  [9;  26;  27;  34;  36;  47], 

One  problem  with  this  general  approach  is  that  it  discards  sig¬ 
nificant  perceptual  and  semantic  information.  While  indexing 
methods  provide  a  means  to  quickly  narrow  a  search  to  a  more 
manageable  subset,  they  often  do  not  provide  a  method  for  closer, 
direct  comparison  of  how  they  are  related.  Rather  than  discard¬ 
ing  useful  similarity  information  by  employing  only  invariants, 
we  believe  that  one  should  use  a  decomposition  that  preserves  as 
much  semantically  meaningful  and  perceptually  important  infor¬ 
mation  as  is  possible,  while  still  providing  an  efficient  encoding 
of  the  original  signal  [42], 

Another  important  problem  with  these  approaches  is  that  most 
are  only  robust  for  rigid  shapes.  Although  many  things  move 
rigidly,  in  many  cases  this  rigid-body  model  is  inadequate.  For 
instance,  most  biological  objects  are  flexible  and  articulated.  To 
describe  these  deformations,  therefore,  it  is  reasonable  to  model 
the  physics  by  which  real  objects  deform.  This  rationale  led  to  the 
physical  modeling  paradigm  of  active  contours  or  snakes[ 28]  and 


deformable  templates  [52;  59].  A  snake  has  a  predefined  structure 
which  incorporates  knowledge  about  the  shape  and  its  resistance 
to  deformation.  By  allowing  the  user  to  specify  forces  that  are  a 
function  of  sensor  measurements,  the  intrinsic  dynamic  behavior 
of  a  physical  model  can  be  used  to  solve  fitting,  interpolation,  or 
correspondence  problems. 

While  snakes  enforced  constraints  on  smoothness  and  the 
amount  of  deformation,  they  could  not  in  their  original  form  be 
used  to  constrain  the  types  of  deformation  valid  for  a  particular 
problem  domain  or  object  class.  This  led  to  the  development  of 
algorithms  which  include  a  priori  constraints  on  the  types  of  al¬ 
lowable  deformations  for  motion  tracking  [6;  7;  10;  16]. 

Cootes  el  a/.  [11;  3]  use  trainable  snakes  for  capturing  the  in¬ 
variant  properties  of  a  class  of  shapes,  by  finding  the  principle 
variations  of  a  snake  via  the  Karhunen-Loeve  transform.  Unfor¬ 
tunately,  this  method  relies  on  the  consistent  sampling  and  la¬ 
beling  of  point  features  across  the  entire  training  set  and  cannot 
handle  large  rotations.  If  different  feature  points  are  present  in 
different  views,  or  if  there  are  very  different  sampling  densities, 
then  the  resulting  models  will  differ  even  if  the  object's  pose  and 
shape  are  identical. 

Keeping  these  issues  in  mind,  we  use  the  Finite  Element 
Method  to  alleviate  problems  with  sampling,  and  modal  analysis 
to  provide  a  principled  way  to  select  the  types  of  nonrigid  defor¬ 
mations  needed  for  flexibly  describing  shape.  In  the  rest  of  this 
section  we  provide  a  brief  review  of  our  representation.  In  addi¬ 
tion,  we  review  our  new  method  of  building  FEM  models  with¬ 
out  imposing  an  a  priori  parameterization,  and  how  to  use  the 
modes  of  this  model  to  find  point  correspondences,  to  align  ob¬ 
jects,  and  to  compare  their  shape.  This  initial  work  was  applied  in 
the  area  of  finding  corresponding  features  in  static  imagery  [50] 
and  serves  as  the  foundation  for  our  new  representation  for  shape 
categories. 

2.1  Finite  Element  Method 

The  major  advantage  of  the  finite  element  method  is  that  it  uses 
the  Galerkin  method  of  surface  interpolation.  This  provides  an 
analytic  characterization  of  shape  and  elastic  properties  over  the 
whole  surface,  and  thereby  alleviates  problems  caused  by  irreg¬ 
ular  sampling  of  feature  points.  In  Galerkin 's  method,  we  set  up 
a  system  of  polynomial  shape  functions  that  relate  the  displace¬ 
ment  of  a  single  point  to  the  relative  displacements  of  all  the  other 
nodes  of  an  object: 

u(x)  =  H(x)U  (1) 

where  E[  is  the  interpolation  matrix,  x  is  the  local  coordinate  of 
a  point  in  the  element  where  we  want  to  know  the  displacement, 
and  U  denotes  a  vector  of  displacement  components  at  each  ele¬ 
ment  node.  By  using  these  functions,  we  can  calculate  the  defor¬ 
mations  which  spread  uniformly  over  the  body  as  a  function  of 
its  constitutive  parameters. 

Solution  to  the  problem  of  deforming  an  elastic  body  to  match 
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the  set  of  feature  points  then  requires  solving  the  dynamic  equi¬ 
librium  equation'. 

MU  +  DU  +  KU  =  R,  (2) 

where  R  is  the  load  vector  whose  entries  are  the  spring  forces 
between  each  feature  point  and  the  body  surface,  and  where  M, 
D,  and  K  are  the  element  mass,  damping,  and  stiffness  matrices, 
respectively  [2;  43], 

2.2  Modal  Representation 

The  FEM  governing  equations  can  be  decoupled  by  posing  the 
equations  in  a  basis  defined  by  the  M-orthogonalized  eigenvec¬ 
tors  of  K.  These  eigenvectors  and  values  are  the  solution  to  the 
generalized  eigenvalue  problem: 

l<Oi  .cfYloi.  (3) 

The  vector  <j>\  is  called  the  ith  mode  shape  vector  and  ut\  is  the 
corresponding  frequency  of  vibration.  Each  mode  shape  vec¬ 
tor  describes  how  each  node  is  displaced  by  the  ith  vibration 
mode.  The  mode  shape  vectors  are  M-orthonormal;  this  means 

that  <b  /  K<f>  =  and  <b7  M<f>  =  I.  The  <f>\  form  columns  in 

the  transform  <T>  and  u>?  are  elements  of  the  diagonal  matrix  f2J. 
We  will  assume  Rayleigh  damping  (i.e.,  D  =  aoM  +  a  \  K  ),  thus 
the  damping  matrix  will  also  be  diagonalized  by  this  transform 
[2]. 

This  generalized  coordinate  transform  <J>  is  then  used  to  trans¬ 
form  between  nodal  point  displacements  U  and  decoupled  modal 
displacements  U,  U  =  ‘T>U.  We  can  now  rewrite  Eq.  2  in  terms 
of  these  generalized  or  modal  displacements,  obtaining  a  decou¬ 
pled  system  of  equations: 

U  +  DU  +  D2U  =  $tR,  (4) 

allowing  for  closed-form  solution  to  the  equilibrium  problem 
[43],  Given  this  equilibrium  solution  in  the  two  images,  point 
correspondences  can  be  obtained  directly. 

By  discarding  high  frequency  modes  the  amount  of  compu¬ 
tation  required  can  be  minimized  without  significantly  altering 
correspondence  accuracy.  Moreover,  such  a  set  of  modal  ampli¬ 
tudes  provides  a  robust,  canonical  description  of  shape  in  terms 
of  deformations  applied  to  the  original  elastic  body.  This  allows 
them  to  be  used  directly  for  object  recognition  [43]. 

2.3  Modal  Matching 

Perhaps  the  major  limitation  of  previous  methods  is  the  re¬ 
quirement  that  every  object  be  described  as  the  deformations  of  a 
single  prototype  object.  For  instance,  in  some  schemes  all  shapes 
are  represented  as  deformations  from  an  elliptical  or  circular  pro¬ 
totype  [9;  20;  43],  Such  approaches  implicitly  impose  an  a  pri¬ 
ori  parameterization  upon  the  sensor  data,  and  therefore  implic¬ 
itly  determine  the  correspondences  between  data  and  prototype. 


Input:  features  Output:  strongest 

these  are  FEM  nodes  feature  correspondences 


Figure  4:  Modal  matching  system  diagram  (reprinted  from  [48]). 

Furthermore,  an  elliptical  prototype  may  be  inadequate  for  many 
shapes,  especially  shapes  that  are  not  star-connected,  or  those  that 
have  long  protrusions  or  deep  concavities.  We  would  like  to  avoid 
these  problems  as  much  as  possible,  by  letting  the  data  determine 
the  parameterization  in  a  natural  manner.  To  accomplish  this  we 
use  the  data  itself  to  define  the  deformable  object,  by  building 
stiffness  and  mass  matrices  that  use  the  positions  of  image  fea¬ 
ture  points  as  the  finite  element  nodes. 

The  resulting  new  modeling  formulation  is  called  modal 
matching ,  and  is  described  in  detail  in  [48;  50].  A  flow-chart  of 
our  method  is  shown  in  Fig.  4.  For  each  image  we  start  with  fea¬ 
ture  point  locations,  which  are  used  as  nodes  in  building  a  finite 
element  model  of  the  shape.  If  we  are  given  a  support  function, 
then  we  can  “cut”  the  finite  element  sheet  into  any  shape.  We  do 
this  by  defining  a  support  function  that  is  zero  anywhere  outside 
the  shape  region,  and  greater  than  zero  inside  the  shape  region. 
Thus  the  support  function  can  be  used  to  define  both  the  shape 
and  the  thickness  of  the  elastic  model.  A  Gaussian  is  then  cen¬ 
tered  at  each  node.  Together,  these  Gaussians  form  a  basis  for 
building  the  Galerkin  interpolants  of  Eq.  1,  and  are  thus  used  in 
constructing  FEM  mass  and  stiffness  matrices. 

When  there  are  possibly  hundreds  of  feature  points  for  each 
shape,  computing  the  FEM  model  and  eigenmodes  for  the  full 
feature  set  can  become  non-interactive.  For  efficiency,  we  can  se¬ 
lect  a  subset  of  the  feature  data  to  build  a  lower-resolution  finite 
element  model  and  then  use  the  resulting  eigenmodes  in  find¬ 
ing  the  higher-resolution  feature  correspondences  as  described  in 
[50],  This  subset  can  be  a  set  of  particularly  salient  features  (i.e., 
corners,  T-junctions,  and  edge  mid-points)  or  a  randomly  selected 
subset  of  (roughly)  uniformly-spaced  features. 

We  then  compute  the  inodes  of  free  vibration  $  of  this  model 
using  Eq.  3.  The  modes  of  an  object  form  an  orthogonal  object- 
centered  coordinate  system  for  describing  feature  locations.  That 
is,  each  feature  point  location  can  be  uniquely  described  in  terms 
of  how  it  projects  onto  each  eigenvector,  i.e.,  how  it  participates 
in  each  deformation  mode.  The  transform  between  Cartesian  fea¬ 
ture  locations  ( x ,  y)  and  modal  feature  locations  (u,  v)  is  accom- 
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plished  by  using  the  eigenvectors  $  as  a  coordinate  basis: 

(5) 

where  to  is  the  number  of  nodes  used  to  build  the  finite  element 
model.  The  column  vector  0 \  is  the  ith  mode  shape,  and  describes 
the  modal  displacement  (u,  v)  at  each  feature  point  due  to  the 
ith  mode,  while  the  row  vector  Uj  and  Vi  are  the  ith  generalized 
feature  vectors,  which  together  describe  the  feature's  location  in 
the  modal  coordinate  system. 

Normally  only  the  n  lowest-order  modes  are  used  in  forming 
this  coordinate  system,  so  that  (1)  we  can  compare  objects  with 
differing  numbers  of  feature  points,  and  (2)  ensure  that  the  feature 
point  descriptions  are  insensitive  to  noise.  Depending  upon  the 
demands  of  the  application,  we  can  also  selectively  ignore  rigid- 
body  modes,  or  low-order  projective-like  modes,  or  modes  that 
are  primarily  local.  Consequently,  we  can  match,  describe,  and 
compare  nonrigid  objects  in  a  very  flexible  and  general  manner. 

Point  correspondences  can  now  be  determined  by  comparing 
the  two  groups  of  generalized  feature  vectors.  The  important 
idea  here  is  that  the  low-order  vibration  modes  computed  for  two 
similar  objects  will  be  very  similar  —  even  in  the  presence  of 
affine  deformation,  nonrigid  deformation,  local  shape  perturba¬ 
tion,  noise,  or  small  occlusions.  The  points  that  have  the  most 
similar  and  unambiguous  coordinates  are  then  matched,  with 
the  remaining  correspondences  determined  by  using  the  physi¬ 
cal  model  as  a  smoothness  constraint  [48;  50],  Currently,  the 
algorithm  has  the  limitation  that  it  cannot  reliably  match  largely 
occluded  or  partial  objects. 

2.4  Recovering  Modal  Descriptions 


$  =  [01  |  ...  |  02 m]  = 


Given  that  modal  models  have  been  computed  for  both  shapes, 
and  that  correspondences  have  been  established,  we  can  solve 
for  the  modal  displacements  directly  —  if  correspondence  is 
known  at  all  nodes.  Unfortunately,  correspondence  is  not  usually 
available  at  all  nodes,  and  our  recovery  problem  becomes  under¬ 
constrained.  Since  the  modal  matching  algorithm  computes  the 
strength  for  each  matched  feature,  we  would  also  like  to  utilize 
these  match-strengths  directly  in  alignment.  As  detailed  in  [50], 
we  can  obtain  a  constrained  weighted  least  squares  solution,  if 
we  minimize  alignment  error  that  includes  a  modal  strain  energy 
term  AfD : 

u  =  [$Tw2$  +  xn2] _1  $Tw2u  (6) 

where  entries  of  the  diagonal  weighting  matrix  W  are  inversely 
proportional  to  the  affinity  measure  for  each  feature  match.  The 
entries  for  unmatched  features  are  set  to  zero.  The  strain  term 
A flo  directly  parallels  the  smoothness  functional  employed  in 
regularization  [53].  This  measure  allows  us  to  incorporate  some 
prior  knowledge  about  how  “stretchy”  the  shape  is,  how  much  it 
resists  compression,  etc.  Unmatched  nodes  to  move  in  a  man¬ 
ner  consistent  with  the  material  properties  and  the  forces  at  the 
matched  nodes. 

3  Encoding  Modal  Shape  Categories 

We  will  now  describe  how  to  use  modal  models  to  encode 
shape  categories.  One  key  advantage  in  using  such  a  prototype- 
based  approach  is  that  of  data  reduction:  given  the  multitude  of 
possible  viewpoints  and  configurations  for  an  object,  we  need  to 
reduce  this  multitude  down  to  a  more  efficient  representation  that 
requires  only  a  few  characteristic  views.  Shapes  are  compared  in 
terms  of  their  relative  distances  to  prototypes,  rather  than  directly 
compared  with  one  another. 


Given  point  correspondences  between  two  shapes,  we  can  then 
determine  the  deformations  required  to  align  them.  An  important 
benefit  of  modal  matching  is  that  the  eigenmodes  computed  for 
the  correspondence  algorithm  can  also  be  used  to  describe  the 
rigid  and  non-rigid  deformation  needed  to  align  one  object  with 
another.  Once  this  modal  description  has  been  computed,  we  can 
compare  shapes  simply  by  looking  at  their  mode  amplitudes  or 
—  since  the  underlying  model  is  energy-based  —  we  can  com¬ 
pute  and  compare  the  amount  of  deformation  energy  needed  to 
align  an  object,  and  use  this  as  a  similarity  measure.  If  the  strain 
energy  required  to  align  two  feature  sets  is  relatively  small,  then 
the  objects  are  very  similar. 

Our  task  is  to  recover  the  modal  deformation  parameters  U  that 
take  the  set  of  points  from  the  first  image  to  the  corresponding 
points  in  the  second.  A  number  of  different  methods  for  recover¬ 
ing  the  modal  deformation  parameters  are  described  in  [48;  50], 
We  will  only  give  an  overview  of  the  strain-minimizing  least- 
squares  method  employed  for  the  database  experiments  described 
in  this  paper. 


3.1  Distance  Measures 

Once  the  mode  deformation  parameters  U  have  been  recov¬ 
ered,  we  can  compute  the  strain  energy  incurred  by  these  defor¬ 
mations,  and  use  this  as  a  similarity  metric.  In  general,  we  will 
want  to  compare  the  strain  only  in  a  subset  of  modes  S  that  has 
been  deemed  important  in  measuring  similarity: 

S(A.I))  (7) 

Z  ie<s 

where  the  modal  displacements  v.\  describe  the  deformation 
needed  to  align  shape  A  with  shape  B.  It  may  be  desirable  to 
make  object  comparisons  rotation  and/or  position  independent. 
To  do  this,  we  ignore  displacements  in  the  rigid  body  modes, 
thereby  disregarding  differences  in  position  and  orientation.  In 
addition,  we  can  make  our  comparisons  robust  to  noise  and  local 
shape  variations  by  discarding  higher-order  modes.  This  modal 
selection  technique  is  also  useful  for  its  compactness,  since  we 
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can  describe  deviation  from  a  prototype  in  terms  of  relatively  few 
modes. 

If  a  metric  distance  function  is  desired,  then  this  simple  energy 
measure  needs  to  be  modified:  strain  does  not  satisfy  one  of  the 
three  axioms  for  a  metric  space[55].  These  three  axioms  are: 

1.  minimality:  5{A,B )  >  <5(,4,  ,4)  =  0, 

2.  symmetry:  d'(,4,  B)  =  S(B,  ,4),  and 

3.  triangle  inequality:  6(A,  B)  +  S(B,  C )  >  (5(.4,  C). 

While  the  strain  energy  measure  satisfies  minimality  and  the 
triangle  inequality,  it  does  not  satisfy  symmetry.  The  strain  en¬ 
ergy  is  not  symmetric  for  shapes  of  differing  sizes;  i.e.,  if  the 
scales  of  two  objects  A  and  B  differ,  then  the  strain  energy  needed 
to  align  A  with  B  may  differ  from  that  needed  to  align  B  with 
A.  The  difference  in  strain  will  be  inversely  proportional  to  the 
difference  in  square  of  the  object  scales.  Therefore,  when  com¬ 
paring  objects  of  differing  scales  we  divide  strain  energy  by  the 
shape's  area.  When  a  support  map  is  available,  this  area  can  be 
computed  directly.  In  the  infinite-support  case,  the  area  can  be 
approximated  by  computing  the  minimum  bounding  circle,  or  the 
moments,  for  the  data. 

There  is  an  additional  property  that  proves  useful  in  defin¬ 
ing  a  metric  space,  segmental  additivity:  6{A,B )  +  S{B,C )  = 
S{A,  C),  if  11  is  on  the  line  between  .4  and  C .  To  satisfy  segmen¬ 
tal  additivity,  we  can  take  the  square  root  of  the  strain  energy: 

6{A’B)=  >  (8) 

where  a  is  the  shape's  area.  This  results  in  a  weighted  distance 
metric  not  unlike  the  Mahalanobis  distance:  the  modal  ampli¬ 
tudes  are  decoupled,  each  having  a  “variance”  that  is  inversely 
proportional  to  the  mode's  eigenvalue.  As  a  result,  this  formu¬ 
lation  could  be  used  as  part  of  a  regularized  learning  scheme  in 
which  the  initial  covariance  matrix,  $4  is  iteratively  updated  to 
incorporate  the  observed  modal  parameter  covariances  along  the 
lines  of  [19;  17;  44;  38;  39], 

3.2  Modal  Shape  Prototypes 

Instead  of  looking  at  the  strain  energy  needed  to  align  the  two 
shapes,  we  wish  to  the  compare  mode  amplitudes  needed  to  align 
a  third,  prototype  object  C  with  each  of  the  two  objects.  In  this 
case,  we  first  compute  two  modal  descriptions  Ua  and  Ur,  that 
align  the  prototype  with  each  candidate  object.  We  then  utilize 
our  strain-energy  distance  metric  to  order  the  objects  based  on 
their  similarity  to  that  prototype. 

We  can  use  distance  to  prototypes  to  define  a  low-dimensional 
space  for  efficient  shape  comparison.  In  such  a  scenario,  a  few 
prototypes  are  selected  to  span  the  variation  of  shape  within  each 
category.  Every  shape  in  the  database  is  then  aligned  with  each 
of  the  prototypes  using  modal  matching,  and  the  resulting  modal 
strain  energy  is  stored  as  an  n-tuple  T,  where  n  is  the  number  of 


prototypes.  Each  shape  in  the  database  now  has  a  coordinate  in 
this  “strain-energy-from-prototypes”  space;  shapes  can  be  com¬ 
pared  simply  in  terms  of  their  Euclidean  distance  in  this  space. 

We  have  used  strain  energy  for  most  of  our  object  comparison 
experiments,  since  it  has  a  convenient  physical  meaning;  how¬ 
ever,  we  suspect  that  it  may  sometimes  be  necessary  to  weigh 
higher-frequency  modes  less  heavily,  since  these  modes  typically 
only  describe  high-frequency  shape  variations  and  are  more  sus¬ 
ceptible  to  noise.  For  instance,  we  could  directly  measure  dis¬ 
tances  between  modal  descriptions,  U.  Our  preliminary  experi¬ 
ments  in  prototype-based  shape  description  have  shown  that  this 
metric  yields  comparable  performance  to  the  strain  energy  met¬ 
ric. 

3.3  Spanning  Categories  with  Prototypes 

In  our  current  image  database  system,  a  human  operator  selects 
a  few  example  shapes  that  approximately  span  each  category.  Our 
system  performance  is  therefore  dependent  on  the  user's  ability 
to  select  an  adequately  diverse  and  sufficient  set  of  prototypes.  It 
may  be  desirable  to  have  a  system  that  could  automatically  select 
prototypes  in  an  unsupervised  fashion.  An  unsupervised  learn¬ 
ing  or  clustering  (e.g.,  /c-means,  hiearchical  clustering,  iterative 
optimization,  Bayes  classifiers)  could  be  adapted  for  automati¬ 
cally  selecting  the  prototype  shapes  based  on  modal  matching 
and  modal  strain.  Using  such  methods  introduces  a  tradeoff,  be¬ 
cause  for  many  pattern  classification  and  learning  schemes  it  is 
critical  that  training  data  sets  be  large  and  diverse  enough  to  char¬ 
acterize  the  variations  within  a  particular  shape  class  [14],  This 
shifts  the  pressure  from  a  human  selecting  adequate  prototypes 
to  a  human  providing  sufficient  diverse  and  large  training  data  set 
(and  providing  the  number  of  categories  present).  Finding  these 
clusters  without  prototypes  would  (in  general)  require  matching 
all  shapes  to  all  other  shapes  before  optimal  clusters  could  be  ob¬ 
tained.  In  either  case,  qualities  missing  from  either  the  training 
data  or  the  prototypes  may  be  ignored  or  misinterpreted. 

Another  issue  is  orthogonality.  It  is  unlikely  that  the  selected 
shape  prototypes  will  describe  orthogonal  axes  in  some  idealized 
category  space.  To  ensure  orthogonality  we  have  employed  a 
method  based  on  finding  the  principal  components.  Given  a  set  of 
prototypes,  we  compute  the  strain-to-prototypes  feature  vector  T 
and  its  covariance  matrix  for  a  randomly  selected  subset  of  shapes 
in  the  database.  The  eigenvectors  '!'  of  the  covariance  matrix  are 
used  to  transform  all  T  into  new  coordinates  in  an  orthogonalized 
parameter  space: 

T'  =  A-^T,  (9) 

where  A  is  a  diagonal  matrix  containing  the  eigenvalues.  Com¬ 
puting  distances  in  this  new  space  is  equivalent  to  computing  the 
Mahalonobis  distance  in  the  original  strain-to-prototypes  space. 
As  before,  variation  orthogonal  to  the  space  spanned  by  the  train¬ 
ing  set  will  not  be  represented.  This  may  at  first  seem  like  a  lim¬ 
itation;  however,  this  property  can  be  exploited  to  constrain  the 
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allowable  deformations  to  only  those  that  are  statistically  most 
likely.  Furthermore,  principal  components  with  eigenvalues  less 
than  a  threshold  can  be  discarded  to  gain  a  lower-dimensional 
parameter  space  as  well  as  better  robustness  to  noise  [21]. 

The  transform  to  the  orthogonalized  parameter  space  is  done 
as  a  precomputation  (prior  to  repeated  database  search).  The 
method  has  been  tested  in  experiments  with  a  database  of  hand- 
tool  images,  as  will  be  detailed  in  Section  4. 

3.4  Comparison  Without  Direct  Correspondence 
Computation 

In  an  alternative  method,  we  can  measure  the  distance  between 
modes  and  determine  how  similar  two  shape's  modes  are  without 
actually  computing  feature  correspondences.  This  can  be  accom¬ 
plished  by  measuring  the  Hausdorff  distance  between  the  low- 
order  mode  vectors  of  the  new  shape  and  the  low-order  modes  of 
a  prototype. 

Given  two  mode  vectors  0a  and  <pb ,  one  from  the  first  shape  .4, 
and  another  from  the  second  shape  B,  we  can  define  the  Haus¬ 
dorff  distance  between  these  mode  vectors  as 

H{cpa,  4>b)  =  max(/i(0a,0b),/i(0bj0a))  (10) 

where  /i(0a,0b)  is  the  directed  Hausdorff  distance  from  <pa  to 

0b- 

M0a,0b)=  max  min  ||  0a,i  -  0b,i  ||  ■  (H) 

0a,i£0a  0b,i£0b 

The  distance  norm  is  taken  between  generalized  features  of  Eq. 
5:  (wa, i,tta,i)  and  (wi,j  ,  '<’b,i)-  In  our  experiments,  we  have  used 
a  Euclidean  norm.  This  measure  requires  no  specific  correspon¬ 
dence  between  points  on  the  two  objects. 

In  this  formulation,  we  match  and  compare  modes;  conse¬ 
quently,  for  each  shape  in  the  database  we  tally  the  number  of 
modes  that  match  each  prototype's  modes.  Typically,  mode  dis¬ 
tances  are  computed  for  only  the  lowest-older  25%  or  fewer  of 
the  nonrigid  modes.  These  tallies  can  be  stored  as  coordinates  in 
an  /(-dimensional  similarity  space;  thus,  shape  similarity  is  pro¬ 
portional  to  the  Euclidean  distance  in  this  space. 

Finally,  if  two  shapes  have  no  modes  falling  within  the  reason¬ 
able  tolerance  for  similarity,  then  the  shapes  will  be  flagged  as 
“no  similar  modes.”  This  computation  can  precede  direct  point 
correspondence  or  alignment  computation.  The  lack  of  modal 
similarity  is  a  strong  clue  that  the  shapes  are  probably  from  dif¬ 
ferent  categories,  and  therefore,  attempting  correspondence  and 
alignment  would  be  unreasonable.  This  method  is  used  in  the 
experiments  with  the  hand  tool  image  database  described  in  the 
next  section. 

4  Experiments  in  Interactive  Search 

In  the  first  set  of  experiments,  our  method  is  used  to  struc¬ 
ture  an  image  database  of  fish.  The  images  in  this  experimental 


(a)  (b)  (c) 


(d)  (e) 

Figure  5:  The  five  prototype  shapes  used  in  the  image  database  experi¬ 
ment:  (a)  Squirrel  Fish,  (b)  Spot  Fin  Butterflyfish,  (c)  Coney,  (d)  Horse 
Eye  Jack,  and  (e)  Southern  Sennet. 


fish  that  had  no  modes  matching  this  prototype 

Figure  6:  Six  fish  had  no  modes  that  came  within  tolerance  of  matching 
modes  for  the  Butterfly  Fish  prototype  in  Figure  5(b),  and  are  clearly  not 
in  the  Butterfly  Fish  category. 

database  were  digitized  from  a  children's  field  guide  [22],  Cur¬ 
rently,  there  are  74  images  of  tropical  fish  in  the  database.  Each 
image  depicts  a  fish  from  the  canonical  viewpoint  (side  view), 
though  orientation,  position,  and  scale  vary.  Each  fish  is  unoc¬ 
cluded  and  appears  on  a  uniform  background.  Images  for  this 
and  other  experiments  are  available  for  anonymous  FTP  from  cs- 
pub.bu.edu  in  the  compressed  tar  file  sclarojf/pictures.tar.Z. 

We  used  the  prototype-based  shape  description  method  formu¬ 
lated  in  Sec.  3.2,  where  each  shape's  strain-energy  distance  to 
the  prototypes  was  precomputed  and  stored  for  interactive  search 
later.  First,  for  each  image,  a  support  map  and  edge  image  was 
computed,  a  finite-support  shape  model  was  built,  and  then  the 
eigenmodes  were  determined.  For  the  shapes  in  this  experiment, 
approximately  60-70  finite  element  nodes  were  chosen  so  as  to 
be  roughly-regularly  spaced  across  the  support  region. 

Each  shape  in  the  database  is  then  modal  matched  to  a  set  of 
prototype  images.  There  were  five  fish  prototypes  as  shown  in 
Fig.  5.  These  prototype  images  were  selected  by  a  human  oper¬ 
ator  so  as  to  span  the  range  of  shapes  in  the  database.  For  fish 
prototypes,  we  chose  prototypes  that  span  the  range  from  skinny 
fish  (Fig.  5(e)),  to  fat  fish  (5(b)),  and  from  smooth  fish  (5(c))  to 
prickly  or  pointy-tailed  fish  (5(a,d)). 

Not  all  shapes  in  the  database  have  similar  modes  (similarity 
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is  measured  to  within  a  threshold).  This  information  was  quickly 
deterined  by  using  the  Hausdorff  distance  measure  described  in 
Section  3.4.  Sometimes,  as  is  shown  in  Fig.  6,  even  shapes  within 
the  same  category  do  not  have  similar  modes.  In  this  particular 
case,  the  modes  of  the  wide -bodied.  Butterfly  Fish  prototype  of 
Fig.  5  did  not  match  well  with  the  modes  of  the  most  narrow¬ 
bodied  fish.  Using  the  more  efficient  Hausdorff  distance,  we  can 
quickly  determine  when  modes  are  nowhere  near  being  similar, 
and  no  attempt  at  alignment  and  strain  energy  computation  is 
made.  Such  shapes  are  simply  flagged  as  being  “not  at  all  simi¬ 
lar”  to  a  particular  shape  prototype,  as  described  in  Section  3.4. 

The  resulting  modal  strain  energy  was  then  used  as  a  similar¬ 
ity  metric  in  Photobook,  an  image  database  management  system 
developed  at  the  MIT  Media  Lab  [42],  Using  Photobook,  the 
user  selected  the  image  at  the  upper  left,  and  the  system  retrieved 
the  remaining  images  sorted  by  strain  energy  (shape  similarity) 
from  left  to  right,  top  to  bottom.  The  similarity  measure  is  shown 
below  each  image. 

The  database  searches  in  Figs.  7  through  10  were  conducted 
using  distance  in  prototype-space.  In  Fig.  7,  a  Banded  Butterfly- 
fish  was  selected.  The  matches  are  shown  in  order,  starting  with 
the  most  similar.  Based  on  mode-similarity-distance,  the  system 
retrieved  the  animal  shapes  that  were  closest  to  the  Banded  But¬ 
terfly  Fish  shape  (other  Butterfly  Fish,  and  other  fat-bodied  fish). 
In  the  second  search,  shown  in  Fig.  8,  a  Tmmpet  Fish  was  se¬ 
lected.  In  this  case,  the  system  retrieved  similar  long  and  skinny 
fish. 

In  both  searches,  the  fish  judged  “most  similar”  by  the  sys¬ 
tem  appeared  on  the  same  page  in  the  field  guide.  This  type  of 
similarity  judgment  performance  is  an  encouraging  result,  since 
fish  appearing  under  the  same  heading  are  nearly  always  in  the 
same  taxonomic  category,  e.g.,  Groupers,  Jacks,  Snappers,  Por- 
gies,  Squirrelfishes,  Butterflyfishes,  Hamlets,  or  Damselfishes. 
In  the  cases  where  fish  listed  under  the  same  heading  are  not  in 
the  same  taxonomic  category  it  is  because  they  were  grouped  to¬ 
gether  due  to  some  shape  similarity,  e.g.,  “Slim-bodied  fishes”  is 
the  heading  under  which  the  Trumetfish,  Bluespotted  Cornetfish, 
Balao,  Needlefish,  Ballyhoo,  and  Houndfish  appear. 

Fig.  9  continues  this  example,  this  time  searching  for  shapes 
most  similar  to  a  Crevalle  Jack  and  a  Dog  Snapper.  Again,  the 
matches  are  shown  in  order,  starting  with  the  most  similar.  The 
shapes  most  similar  to  a  Crevalle  Jack  are  other  fish  with  simi¬ 
lar  body  and  tail  shapes.  In  this  case,  the  system  rates  Jolt  Head 
Porgy  over  a  closer  relative  (Yellow  Jack).  This  is  fairly  reason¬ 
able,  since  all  are  closely-related,  open  water  fish. 

In  the  final  example.  Fig.  10,  the  user  selected  a  Dog  Snapper. 
Again  the  system  rated  fish  from  the  same  pages  in  the  field  guide 
as  “most  similar.”  In  each  example,  search  and  display  took  less 
than  a  second  on  an  HP  735. 

Database  queries  were  performed  for  each  of  the  72  fish  im¬ 
ages  in  the  database  for  which  there  were  other  fish  under  same 
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Figure  7:  Searching  an  image  database  for  similarly-shaped  fish.  In  this 
example,  distance  in  mode-similarity-space  was  used  as  a  shape  similar¬ 
ity  metric.  The  figure  shows  the  first  of  four  examples  of  the  ordering 
that  resulted  in  searches  for  similar  fish:  a  Banded  Butterfly  Fish.  The 
matches  are  shown  in  order,  starting  with  the  most  similar.  Based  on 
mode-similarity-distance,  the  system  retrieved  the  animal  shapes  that 
were  closest  to  the  Banded  Butterfly  Fish  shape  (other  Butterfly  Fish, 
and  other  fat-bodied  fish).  The  fish  judged  “most  similar”  by  the  system 
appeared  on  the  same  page  in  the  original  field  guide  book,  and  in  the 
same  taxonomic  class. 


heading  in  the  field  guide.  Overall,  another  fish  under  the  same 
heading  in  the  field  guide  was  judged  as  most  similar  71%  of 
the  time.  To  gain  enhanced  performance  in  capturing  animal  tax¬ 
onomies,  we  suspect  that  modal  matching  would  need  to  be  part 
of  a  combined  system  that  includes  local  feature  and  color  infor¬ 
mation. 

For  comparison,  the  same  72  queries  were  performed  using 
moment  invariants  based  on  second-  and  third-order  moments 
[15].  To  gain  better  performance,  the  covariance  matrix  for 
the  seven-dimensional  feature  vectors  was  computed  and  shapes 
where  ordered  in  terms  of  their  Mahalonobis  distances  to  the  se¬ 
lected  shape.  In  this  case  another  fish  under  the  same  heading  in 
the  field  guide  was  judged  as  most  similar  57%  of  the  time. 
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Figure  8:  Searching  an  image  database  for  similarly-shaped  fish  (con¬ 
tinued):  Trumpet  Fish.  The  matches  are  shown  in  order,  starting  with  the 
most  similar.  In  this  second  search  the  system  retrieved  similar  long  and 
skinny  fish,  that  are  on  the  same  page  in  the  original  field  guide. 

4.1  Evaluating  Retrieval  Accuracy  using  AVRR 

Thus  far  retrieval  performance  has  been  measured  in  terms  of 
the  percentage  of  times  that  a  shape  in  the  same  category  is  re¬ 
trieved  as  “most  similar”  over  a  number  of  trials.  However,  the 
Photobook  system,  and  other  query  by  example  (QBE)  image 
database  systems  (IBM,  Virage,  Jacob)  provide  a  list  of  possi¬ 
ble  matches  ordered  in  terms  of  their  similarity  distance  from  the 
example  image.  This  is  in  contrast  to  retrieval  systems  based  on 
“exact  match.” 

In  “exact  match”  systems  the  standard  measures  of  precision 
and  recall  can  be  employed  [46] .  However,  as  noted  by  Faloutsos, 
et  al.  [20],  systems  that  offer  a  list  of  items  sorted  by  similarity 
do  not  fall  under  the  rubric  of  exact  matching.  We  need  a  perfor¬ 
mance  measure  that  embodies  the  positions  in  which  target  items 
appear  in  the  retrieval.  Ideally,  if  there  were  a  total  of  n  items 
of  the  same  category  in  the  database,  then  these  n  items  would 
appear  in  the  first  n  positions  for  a  similarity-based  retrieval. 

To  evaluate  retrieval  performance  we  will  employ  the  normal¬ 
ized  recall  metric  developed  to  evaluate  IBM's  QBIC  [20].  As¬ 


Figure  9:  Ordering  that  resulted  in  searches  for  similar  fish  (continued): 
a  Crevalle  Jack.  The  shapes  most  similar  to  a  Crevalle  Jack  are  other  fish 
with  similar  body  shapes  and  pointed  tails  (other  open  water  fish). 

sume  that  the  number  of  categories,  shapes  per  category,  and  cat¬ 
egory  membership  for  each  shape  are  known.  We  can  measure 
the  average  rank  of  all  relevant  items  (AVRR)  for  a  particular  re¬ 
trieval  and  then  compare  this  with  the  ideal  average  rank  (IAVRR ) 
when  all  n  images  from  a  particular  shape  category  appear  in  the 
first  n  positions.  For  a  database  that  contains  n  shapes  in  each 
category  IAVRR  =  If.  In  general,  the  equation  for  ideal  average 
rank  is 

IAVRR  (12) 

mi 

where  to  is  the  total  number  of  shapes  in  the  database,  c  is  the 
number  of  categories,  and  n,  is  the  number  of  shapes  in  the  ith 
category.  The  AVRR  is  computed  based  on  the  actual  ordered 
ranking  of  shapes  for  each  database  retrieval.  Thus  the  ratio  of 
AVRR  to  IAVRR  can  be  used  to  give  a  measure  of  average  re¬ 
trieval  accuracy  over  a  number  of  experimental  trials. 

Using  this  measure,  the  retrieval  accuracy  was  evaluated  for  the 
previously  described  experiments  with  the  fish  image  database  in 
Photobook.  The  IAVRR  for  this  database  was  3.4  and  the  AVRR 
was  8.9.  This  means  that  on  average  the  relevant  image  appears 
in  the  ninth  position.  The  ratio  of  AVRR/IAVRR  =  2.6.  In  con- 
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Figure  10:  Ordering  that  resulted  in  search  for  fish  similar  to  a  Dog 
Snapper.  The  system  rated  fish  from  the  same  page  in  the  field  guide 
as  “most  similar.”  In  each  example,  search  and  display  took  less  than  a 
second  on  an  HP  735. 

trast,  the  AVRR  was  17.7  and  AVRR/IAVRR  =  5.2  when  moment 
invariants  were  used. 

4.2  Tool  Image  Database 

In  a  second  set  of  image  database  experiments  we  used  a 
database  of  63  grayscale  images  of  real  and  toy  hand  tools.  There 
were  21  images  from  each  of  three  tool  categories:  wrenches, 
hammers,  and  crescent  wrenches.  Figure  1 1  shows  example  im¬ 
ages  taken  from  this  database.  Note  that  because  the  toy  tools 
were  made  of  plastic,  they  could  be  bent  in  various  ways.  Fur¬ 
ther,  tools  appeared  in  a  number  of  orientations  and/or  scales, 
with  varying  lighting.  The  tools  were  placed  on  a  uniform  back¬ 
ground  so  that  a  simple  fuzzy  c-Means  clustering  technique  could 
be  used  for  foreground/background  separation  [31]. 

For  the  shapes  in  this  experiment,  approximately  70-80  finite 
element  nodes  were  chosen  so  as  to  be  roughly-regularly  spaced 
across  the  support  region.  Mode  amplitudes  for  the  first  32  modes 
were  recovered  and  used  to  warp  each  prototype  onto  the  other 
tools.  As  in  the  fish  database  experiments,  the  Hausdorff  distance 
method  was  used  to  cull  cases  where  no  modes  matched.  The 


Figure  1 1 :  Some  example  images  from  the  hand  tools  experimental 
image  database.  There  are  63  images  of  children's  toy  tools  and  adult 
tools  in  the  database,  2 1  each  of  category  hammer,  single-ended  wrench, 
and  double-ended  wrench.  Because  the  toy  tools  were  made  of  plastic, 
they  could  be  bent  in  various  ways.  Further,  tools  appeared  in  a  number 
of  orientations  and/or  scales,  with  varying  lighting. 

comparisons  were  made  translation  and  rotation  invariant  by  ig¬ 
noring  displacements  in  the  rigid  body  modes.  Comparisons  were 
made  scale  invariant  by  recovering  the  scale  factor  before  non- 
rigidly  warping  the  shape  to  each  prototype  [25].  Total  CPU  time 
for  database  precomputation  (match,  align,  and  store  /(-tuple)  av¬ 
eraged  3  seconds  per  prototype  on  an  SGI  Indigo2  workstation. 

Matching  experiments  were  then  conducted  using  the  coordi¬ 
nates  produced  via  the  orthogonalization  procedure  in  Section 
3.2.  Database  queries  were  performed  for  each  of  the  63  tool 
images  in  the  database.  Overall,  another  tool  from  the  same  cate¬ 
gory  was  judged  as  most  similar  94%  of  the  time,  compared  with 
86%  for  the  moments-based  method. 

For  orthogonalized  strain-from-prototypes,  the  AVRR  was 
18.2;  this  means  that  the  average  relevant  image  appears  in 
roughly  in  the  fifteenth  position.  The  IAVRR  for  this  database 
is  10.5.  Thus  the  ratio  of  AVRR/IAVRR  =  1.7.  The  moments- 
based  method  produced  AVRR  =23.1  and  AVRR/IAVRR  =  2.2. 
As  another  point  of  comparison,  performance  for  shape-based 
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Number  of  prototypes 

Figure  12:  Graph  showing  how  the  number  of  prototypes  affects  the 
average  performance  for  retrieval  given  a  database  of  63  handtools,  21 
tools  from  each  of  three  categories.  For  each  trial,  n  prototypes  were 
chosen  at  random  from  the  database,  and  searches  conducted  in  orthog- 
onalized  strain-to-prototypes  space.  For  each  n  there  were  up  to  1000 
trials.  Database  queries  were  performed  for  each  of  the  63  tool  images 
in  the  database.  With  only  one  prototype,  another  tool  from  the  same 
category  was  judged  as  most  similar  56%  of  the  time.  As  the  number  of 
prototypes  reached  four,  performance  began  to  level  off. 

search  in  QBIC  was  reported  to  have  a  AVRR/IAVRR  ratio  of 
1.8  in  [20]  for  a  database  of  777  airplane  silhouettes  coarsely  cat¬ 
egorized  by  viewpoint  and  overall  shape  properties. 

4.3  Number  of  Prototypes  and  Retrieval  Accuracy 

Using  the  tool  image  database,  an  experiment  was  conducted 
to  evaluate  retrieval  accuracy  as  a  function  of  the  number  of  pro¬ 
totypes  used.  Multiple  trials  were  conducted  using  between  one 
and  ten  prototypes.  These  n  prototypes  were  selected  at  random 
(uniformly  distributed),  in  1000  trials  for  each  n.  Average  match¬ 
ing  perfmance  was  evaluated  using  the  coordinates  produced  via 
the  orthogonalization  procedure  in  Section  3.2.  In  each  trial, 
database  queries  were  performed  for  each  of  the  63  tool  images 
in  the  database. 

Figures  13  and  12  show  the  resulting  performance  curves.  The 
graph  in  Figure  12  shows  how  the  number  of  prototypes  affects 
the  average  performance  for  database  queries  performed  for  each 
of  the  63  tool  images  in  the  database.  With  only  one  prototype, 
another  tool  from  the  same  category  was  judged  as  most  simi¬ 
lar  56%  of  the  time.  As  the  number  of  prototypes  reached  four, 
performance  began  to  level  off  at  approximately  90%. 

The  graph  in  Figure  13  shows  how  the  number  of  prototypes 
affects  the  average  AVRR  for  retrieval  of  the  21  handtools  in  the 
same  tool  category.  With  only  one  prototype,  the  AVRR  aver¬ 
aged  28. 1 .  The  average  performance  leveled  out  at  5  prototypes 
where  AVRR  =  21.9.  The  ideal  AVRR  would  be  10.5.  The  ratio 
AVRR/IAVRR  is  greater  than  two. 


Number  of  prototypes 

Figure  13:  Graph  showing  how  the  number  of  prototypes  affects  the 
average  rank  (AVRR)  for  retrieval  of  the  21  handtools  in  the  same  tool 
category.  For  each  trial,  n  prototypes  were  chosen  at  random  from  the 
database,  and  searches  conducted  in  orthogonalized  strain-to-prototypes 
space.  For  each  n  there  were  up  to  1000  trials.  With  only  one  prototype, 
the  AVRR  averaged  28.1.  The  average  performance  leveled  out  at  5 
prototypes  where  AVRR  =  21.9.  The  ideal  AVRR  would  be  10.5. 

5  Discussion 

One  of  the  main  motivations  for  this  research  was  to  pro¬ 
vide  improved  shape  representations  for  query  by  image  content. 
While  the  shape  comparison  algorithms  developed  in  the  machine 
vision  and  pattern  recognition  communities  can  serve  as  a  good 
starting  point  for  developing  shape -based  image  database  search 
methods,  retrieval  by  shape  is  still  considered  to  be  one  of  the 
most  difficult  aspects  of  content-based  image  search  [20]. 

IBM's  Query  By  Image  Content  system  (QBIC)  [20;  36]  is 
perhaps  the  most  advanced  image  database  system  to  date;  it  is 
available  as  a  commercial  product.  QBIC  can  perform  searches 
that  combine  information  about  shape,  color,  and  texture.  As  in¬ 
put,  the  system  assumes  non-occluded,  planar  shapes  that  are  rep¬ 
resented  as  a  binary  image.  Shape-based  search  in  QBIC  cannot 
deal  well  with  nonrigid  deformation.  Algebraic  moment  invari¬ 
ants  [51]  were  intended  for  modeling  rigid  objects  only.  In  addi¬ 
tion,  the  higher  moments  are  dominated  by  points  that  are  farthest 
from  the  centroid;  therefore,  they  are  highly  susceptible  to  out¬ 
liers.  Similar  moments  do  not  necessarily  guarantee  perceptually 
similar  shapes. 

Other  shape  indexing  schemes  have  been  based  on  local 
boundary  features  [34;  23],  and  are  therefore  not  very  robust  to 
noise,  scale,  and  sampling.  Another  system,  proposed  by  Chen 
[9]  identified  2-D  aircraft  shapes  using  elliptic  Fourier  descrip¬ 
tors.  Because  it  is  Fourier  descriptor-based,  Chen's  system  suf¬ 
fers  from  problems  with  sampling  and  parameterization.  Jagadish 
introduced  a  multidimensional  indexing  scheme  that  offered  the 
advantage  that  it  could  index  images  much  faster  than  previous 
techniques  [27],  However,  the  system  had  limited  descriptive 
power,  because  the  shape  similarity  measure  was  too  simple  (the 
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area  difference  between  two  shapes)  and  the  underlying  shape 
representation  was  polyhedral  (representing  shapes  in  terms  of 
K-d-b  trees  of  overlapping  minimum  bounding  rectangles). 

In  contrast  to  previous  formulations,  the  FEM  integrals  used  in 
the  modal  model  formulation  provide  greater  robustness  to  sam¬ 
pling,  outliers,  and  missing  data.  Furthermore,  modal  models 
provide  quasi-invariance  to  different  types  of  nonrigid  deforma¬ 
tion,  while  also  providing  an  ordered,  orthogonal,  encoding  of 
the  nonrigid  deformation  that  relates  a  candidate  shape  to  a  shape 
prototype  or  shape  category. 

5.1  Matching  Human  Similarity  Judgments 

For  a  image  database  search  to  be  useful,  it  is  critical  that  the 
shape  similarity  metric  be  able  to  match  human  judgments  of 
similarity.  This  is  not  to  say  that  the  computation  must  some¬ 
how  mimic  the  human  visual  system;  but  rather  that  computer 
and  human  judgments  of  similarity  must  be  generally  correlated. 
Without  this,  the  images  the  computer  finds  will  not  be  those  de¬ 
sired  by  the  human  user. 

For  human  shape  similarity  judgments,  sometimes  scale  and 
rotation  invariance  are  important,  other  times  not  [35];  it  is  there¬ 
fore  desirable  to  duplicate  this  performance  in  our  image  database 
search  algorithms.  In  QBIC,  a  weighted  metric  allows  for  subset 
selection,  and  thus  it  provides  selective  invariance  to  size  and  ori¬ 
entation  [20].  Modal  matching  also  provides  this  invariance  to 
size  and  orientation,  but  unlike  any  of  the  shape  representations 
used  in  QBIC,  modal  representations  can  also  be  made  invariant 
to  affine  deformations,  and  thus  selectively  invariant  to  changes 
in  camera  viewpoint.  More  importantly,  the  modal  representa¬ 
tion  provides  deformation  “control  knobs”  that  correspond  qual¬ 
itatively  with  human's  notions  of  perceptual  shape  similarity  [4; 
41].  Shape  is  thought  of  in  terms  of  an  ordered  set  of  deforma¬ 
tions  from  an  initial  shape:  starting  with  bends,  tapers,  shears, 
and  moving  up  towards  higher-frequency  shape  variations. 

5.2  Speed  of  Image  Database  Search 

Another  concern  in  image  database  search  is  the  computa¬ 
tion  speed.  Shape -based  image  database  search  must  be  efficient 
enough  to  be  interactive.  A  search  that  requires  minutes  per  im¬ 
age  is  simply  not  useful  in  a  database  with  millions  of  images. 
Furthermore,  interactive  search  speed  makes  it  possible  for  users 
to  recursively  refine  a  search  by  selecting  examples  from  the  cur¬ 
rently  retrieved  images  and  using  these  to  initiate  a  new  select- 
sort-display  cycle.  Thus  users  can  iterate  a  search  to  quickly 
“zero  in  on”  what  they  are  looking  for. 

As  demonstrated  in  our  image  database  experiments,  searches 
on  databases  over  one  hundred  images  take  less  than  a  second 
(including  image  display)  on  an  HP  735  workstation.  In  addition, 
search  time  in  our  system  scales  linearly  on  the  number  of  shapes 
in  the  database.  Finally,  it  is  possible  that  the  notion  of  build¬ 


ing  up  prototype-based  modal  categories  could  be  exploited  to 
structure  databases  into  taxonomic  trees,  thereby  improving  the 
computational  complexity  of  image  database  search. 

6  Conclusion 

A  new  image  database  search  method  has  been  described.  The 
method  uses  strain  energy  from  deformable  prototypes  to  encode 
shape  categories.  Retrieval  accuracy  of  this  approach  has  been 
demonstrated  in  a  series  of  experiments  with  image  databases  of 
animals  scanned  from  children's  field  guides  and  of  deformable 
hand  tools  digitized  via  a  video  camera.  In  these  experiments, 
the  method  performed  consistently  better  than  search  on  moment 
invariants.  Experiments  were  also  conducted  to  evaluate  retrieval 
accuracy  as  a  function  of  the  number  of  prototypes  used.  Rela¬ 
tively  few  prototypes  were  needed  to  produce  stable  performance 
when  a  new  orthogonalization  scheme  was  employed. 
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